Katie Mazaitis* and Ryan Tibshirani†
*†Machine Learning and †Statistics
Carnegie Mellon University
†Amazon Scholar, AWS Labs
September 1, 2020
The COVIDcast project has many parts:
Today: COVIDcast API and client access in R. Outline:
Note: examples are meant to be demos, all code included
Next talks: Facebook surveys, medical claims data, etc.
How many people have died from COVID-19 per day, in my state, since March 1?
library(covidcast)
deaths = covidcast_signal(data_source = "usa-facts",
signal = "deaths_7dav_incidence_num",
start_day = "2020-03-01", end_day = "2020-08-30",
geo_type = "state", geo_values = "pa")
plot(deaths, plot_type = "line",
title = "COVID-19 deaths in PA (7-day trailing average)")What percentage of daily hospital admissions are due to COVID-19 in PA, NY, TX?
hosp = covidcast_signal(data_source = "hospital-admissions",
signal = "smoothed_adj_covid19",
start_day = "2020-03-01", end_day = "2020-08-28",
geo_type = "state", geo_values = c("pa", "ny", "tx"))
plot(hosp, plot_type = "line",
title = "% of hospital admissions due to COVID-19")What does the current COVID-19 incident case rate look like, nationwide?
cases = covidcast_signal(data_source = "usa-facts",
signal = "confirmed_7dav_incidence_prop",
start_day = "2020-08-30", end_day = "2020-08-30")
plot(cases, title = "Daily new COVID-19 cases per 100,000 people")What does the current COVID-19 cumulative case rate look like, nationwide?
cases = covidcast_signal(data_source = "usa-facts",
signal = "confirmed_cumulative_prop",
start_day = "2020-08-30", end_day = "2020-08-30")
plot(cases, title = "Cumulative COVID-19 cases per 100,000 people",
choro_params = list(legend_n = 6))Where is the current COVID-19 cumulative case rate greater than 2%?
plot(cases, choro_col = c("#D3D3D3", "#FFC0CB"),
title = "Cumulative COVID-19 cases per 100,000 people",
choro_params = list(breaks = c(0, 2000), legend_width = 5))How do some major cities compare in terms of doctor’s visits due to COVID-like illness?
dv = covidcast_signal(data_source = "doctor-visits",
signal = "smoothed_adj_cli",
start_day = "2020-03-01", end_day = "2020-08-28",
geo_type = "msa",
geo_values = name_to_cbsa(c("Pittsburgh", "New York",
"San Antonio", "Miami")))
plot(dv, plot_type = "line",
title = "% of doctor's visits due to COVID-like illness")How do my county and my friend’s county compare in terms of people reporting that they know somebody with COVID symptoms?
sympt = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_hh_cmnty_cli",
start_day = "2020-04-15", end_day = "2020-08-30",
geo_values = c(name_to_fips("Allegheny"),
name_to_fips("Fulton", state = "GA")))
plot(sympt, plot_type = "line", range = range(sympt$value),
title = "% of people who know somebody with COVID symptoms")The COVIDcast API is based on HTTP GET queries and returns data in JSON form. The base URL is https://api.covidcast.cmu.edu/epidata/api.php?source=covidcast
| Parameter | Description | Examples |
|---|---|---|
data_source |
data source | doctor-visits or fb-survey |
signal |
signal derived from data source | smoothed_cli or smoothed_adj_cli |
time_type |
temporal resolution of the signal | day or week |
geo_type |
spatial resolution of the signal | county, hrr, msa, or state |
time_values |
time units over which events happened | 20200406 or 20200406-20200410 |
geo_value |
location codes, depending on geo_type |
* for all, or pa for Pennsylvania |
Estimated % COVID-like illness on April 6, 2020 from the Facebook survey, in Allegheny County: https://api.covidcast.cmu.edu/epidata/api.php?source=covidcast&data_source=fb-survey&signal=raw_cli&time_type=day&geo_type=county&time_values=20200406&geo_value=42003
library(jsonlite)
res = readLines("https://api.covidcast.cmu.edu/epidata/api.php?source=covidcast&data_source=fb-survey&signal=raw_cli&time_type=day&geo_type=county&time_values=20200406&geo_value=42003")
prettify(res)## {
## "result": 1,
## "epidata": [
## {
## "geo_value": "42003",
## "signal": "raw_cli",
## "time_value": 20200406,
## "direction": null,
## "issue": 20200903,
## "lag": 150,
## "value": 0.7614984,
## "stderr": 0.3826746,
## "sample_size": 434.8891
## }
## ],
## "message": "success"
## }
##
For full details, see the API documentation site. There you’ll also find details on:
By default the API returns the most recent data for each time_value. We also provide access to all previous versions of the data, using the following optional parameters:
| Parameter | To get data … | Examples |
|---|---|---|
as_of |
as if we queried the API on a particular date | 20200406 |
issues |
published at a particular date or date range | 20200406 or 20200406-20200410 |
lag |
published a certain number of time units after events occured | 1 or 3 |
Why would we need this? Because many data sources are subject to revisions:
This presents a challenge to modelers: e.g., we have to learn how to forecast based on the data we’d have at the time, not updates that would arrive later
To accommodate, we log revisions even when the original data source does not!
We also provide an R package called covidcast for API access. Highlights:
Still highly under development … much more to come. For now, check out our vignettes:
(Or, you can file an issue or contribute a pull request on our public GitHub repo!)
Let’s examine the revisions or “backfill” present in our doctor’s visits signal. We’ll look at this signal over the month of June, and query the API “as of” each week from June 8 through August 1:
# Loop over "as of" dates, fetch data from the API for each one
as_ofs = seq(as.Date("2020-06-08"), as.Date("2020-08-01"), by = "week")
states = c("az", "ca", "pa", "ny")
dv_as_of = map_dfr(as_ofs, function(as_of) {
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-06-01", end_day = "2020-06-30",
geo_type = "state", geo_values = states, as_of = as_of)
})
dv_as_of$geo_value = factor(dv_as_of$geo_value, levels = states,
labels = abbr_to_name(states, ignore.case = TRUE))
# Now plot the each "as of" time series curve, faceted by state
ggplot(dv_as_of, aes(x = time_value, y = value)) +
geom_line(aes(color = factor(issue))) + facet_wrap(vars(geo_value)) +
labs(color = "Issue date", x = "Date", y = "% doctor's visits due to CLI") +
theme_bw() + theme(legend.pos = "bottom")Now let’s examine the correlations between COVID-19 cases and deaths, per day, across counties. We’ll look at Spearman correlation, starting March 1. Then repeat but for cases lagged back 7 days:
# Fetch confirmed cases and deaths, at the county level, since March 1
start_day = "2020-03-01"
end_day = "2020-08-30"
cases = covidcast_signal("usa-facts", "confirmed_7dav_incidence_num",
start_day, end_day)
deaths = covidcast_signal("usa-facts", "deaths_7dav_incidence_num",
start_day, end_day)
# Consider only "active" counties with at least 500 cumulative cases so far
case_num = 500
geo_values = covidcast_signal("usa-facts", "confirmed_cumulative_num",
max(cases$time), max(cases$time)) %>%
filter(value >= case_num) %>% pull(geo_value)
cases_act = cases %>% filter(geo_value %in% geo_values)
deaths_act = deaths %>% filter(geo_value %in% geo_values)
# Compute correlations, per time, over all counties. Both with original time
# alignment, and with cases lagged backwards in time by 7 days
df_cor1 = covidcast_cor(cases_act, deaths_act, by = "time_value",
method = "spearman")
df_cor2 = covidcast_cor(cases_act, deaths_act, by = "time_value",
method = "spearman", dt_x = -7)
# Stack rowwise into one data frame, then plot time series
df_cor = rbind(df_cor1, df_cor2)
df_cor$dt = factor(c(rep(0, nrow(df_cor1)), rep(-7, nrow(df_cor2))))
ggplot(df_cor, aes(x = time_value, y = value)) +
geom_line(aes(color = dt)) +
labs(title = "Correlation between cases and deaths",
subtitle = sprintf("Over counties with at least %i cases", case_num),
x = "Date", y = "Correlation") +
theme_bw() + theme(legend.position = "bottom")Go to: https://covidcast.cmu.edu … you’ll find everything linked from there!
## A `covidcast_meta` data frame with 322 rows and 15 columns.
##
## Number of data sources : 11
## Number of signals : 88
##
## Summary:
##
## data_source signal county msa hrr state
## doctor-visits smoothed_adj_cli * * * *
## doctor-visits smoothed_cli * * * *
## fb-survey raw_cli * * * *
## fb-survey raw_hh_cmnty_cli * * * *
## fb-survey raw_ili * * * *
## fb-survey raw_nohh_cmnty_cli * * * *
## fb-survey raw_wcli * * * *
## fb-survey raw_whh_cmnty_cli * * * *
## fb-survey raw_wili * * * *
## fb-survey raw_wnohh_cmnty_cli * * * *
## fb-survey smoothed_cli * * * *
## fb-survey smoothed_hh_cmnty_cli * * * *
## fb-survey smoothed_ili * * * *
## fb-survey smoothed_nohh_cmnty_cli * * * *
## fb-survey smoothed_wcli * * * *
## fb-survey smoothed_whh_cmnty_cli * * * *
## fb-survey smoothed_wili * * * *
## fb-survey smoothed_wnohh_cmnty_cli * * * *
## ght raw_search * * *
## ght smoothed_search * * *
## google-survey raw_cli * * * *
## google-survey smoothed_cli * * * *
## hospital-admissions smoothed_adj_covid19 * * * *
## hospital-admissions smoothed_covid19 * * * *
## indicator-combination confirmed_7dav_cumulative_num * * * *
## indicator-combination confirmed_7dav_cumulative_prop * * * *
## indicator-combination confirmed_7dav_incidence_num * * * *
## indicator-combination confirmed_7dav_incidence_prop * * * *
## indicator-combination confirmed_cumulative_num * * * *
## indicator-combination confirmed_cumulative_prop * * * *
## indicator-combination confirmed_incidence_num * * * *
## indicator-combination confirmed_incidence_prop * * * *
## indicator-combination deaths_7dav_cumulative_num * * * *
## indicator-combination deaths_7dav_cumulative_prop * * * *
## indicator-combination deaths_7dav_incidence_num * * * *
## indicator-combination deaths_7dav_incidence_prop * * * *
## indicator-combination deaths_cumulative_num * * * *
## indicator-combination deaths_cumulative_prop * * * *
## indicator-combination deaths_incidence_num * * * *
## indicator-combination deaths_incidence_prop * * * *
## indicator-combination nmf_day_doc_fbc_fbs_ght * * *
## indicator-combination nmf_day_doc_fbs_ght * * *
## jhu-csse confirmed_7dav_cumulative_num * * * *
## jhu-csse confirmed_7dav_cumulative_prop * * * *
## jhu-csse confirmed_7dav_incidence_num * * * *
## jhu-csse confirmed_7dav_incidence_prop * * * *
## jhu-csse confirmed_cumulative_num * * * *
## jhu-csse confirmed_cumulative_prop * * * *
## jhu-csse confirmed_incidence_num * * * *
## jhu-csse confirmed_incidence_prop * * * *
## jhu-csse deaths_7dav_cumulative_num * * * *
## jhu-csse deaths_7dav_cumulative_prop * * * *
## jhu-csse deaths_7dav_incidence_num * * * *
## jhu-csse deaths_7dav_incidence_prop * * * *
## jhu-csse deaths_cumulative_num * * * *
## jhu-csse deaths_cumulative_prop * * * *
## jhu-csse deaths_incidence_num * * * *
## jhu-csse deaths_incidence_prop * * * *
## quidel covid_ag_raw_pct_positive * * * *
## quidel covid_ag_smoothed_pct_positive * * * *
## quidel raw_pct_negative * *
## quidel raw_tests_per_device * *
## quidel smoothed_pct_negative * *
## quidel smoothed_tests_per_device * *
## safegraph completely_home_prop * *
## safegraph full_time_work_prop * *
## safegraph median_home_dwell_time * *
## safegraph part_time_work_prop * *
## usa-facts confirmed_7dav_cumulative_num * * * *
## usa-facts confirmed_7dav_cumulative_prop * * * *
## usa-facts confirmed_7dav_incidence_num * * * *
## usa-facts confirmed_7dav_incidence_prop * * * *
## usa-facts confirmed_cumulative_num * * * *
## usa-facts confirmed_cumulative_prop * * * *
## usa-facts confirmed_incidence_num * * * *
## usa-facts confirmed_incidence_prop * * * *
## usa-facts deaths_7dav_cumulative_num * * * *
## usa-facts deaths_7dav_cumulative_prop * * * *
## usa-facts deaths_7dav_incidence_num * * * *
## usa-facts deaths_7dav_incidence_prop * * * *
## usa-facts deaths_cumulative_num * * * *
## usa-facts deaths_cumulative_prop * * * *
## usa-facts deaths_incidence_num * * * *
## usa-facts deaths_incidence_prop * * * *
## youtube-survey raw_cli *
## youtube-survey raw_ili *
## youtube-survey smoothed_cli *
## youtube-survey smoothed_ili *